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Abstract —Many applications in real world use linear data 
structures, such as string or vector. The linear data type may 
omit the information at its edges, especially for flow data. In this 
paper, we present a ring representation technique for data. Our 
experiment results on flow-based Network data show that the 
new approach archives prominent classification rates. 
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intrusion, immune system. 

I. Introduction 

As we known, many applications use two types of linear 
data representation: string and real-valued vector. For both 
popular types, representations are linear structure of symbols 
or numbers. They may omit information at the edges (the 
begin and the end) of these structures and lead to reduce 
classification rates. 

Our idea of new data presentaion originates from an earlier 
empirical implementation on binary ring-based strings. Using 
our ring-typed data representation shows that both detection 
rates and accuracy rate are higher than that of the linear ones, 
while false alarm rates are quite similar. So we use ring 
structures instead of linear ones for more exact classification. 

In this paper, Artificial Immune System (AIS) [1], a 
multidisciplinary research area that combines the principles 
of immunology and computation, is used for experiments on 
the proposed representation. 

AIS is inspired by the observation of the behaviors and the 
interaction of normal component of biological systems - the 
self -and abnormal ones - the nonself. Positive Selection 
Algorithm (PSA) is a popular model of AIS mainly designed 
for one-class learning problems such as anomaly detection. 

The outline of a typical PSA contains two stages [1]. In the 
generation stage (Fig. 1), the detectors are generated by some 
random process and censored by trying to match given self 
samples taken from set S. collection of detectors (or detector 
set) is used to verify whether an incoming data instance is self 
or nonself. 
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Those candidates that match are kept as detectors in set D and 
the rest are eliminated. In the detection stage (Fig. 2), the. If it 


matches any detector, it is claimed as self, otherwise it is 
nonself or an anomaly. This description is limited to some 
extent, but conveys the essential idea. 



Fig. 1. Model of detector generation in PSA 



Fig. 2. Detection of new instances in PSA 
II. BASIC TERMS AND DEFINITIONS 

In PSAs, an essential component is the matching rule which 
determines the similarity between detectors and self samples 
(in the detector generation phase) and coming data instances 
(in the detection phase). Obviously, the matching rule is 
dependent on detector representation. In this paper, both self 
and nonself cells are represented as binary strings of fixed 
length. This representation is the most simple and popular 
representation of data in AISs, and other representations 
(such as real valued) could be reduced to binary [15, 16]. 

3.1 Strings 

An alphabet 2 is nonempty and finite set of symbols. A 
string sGS is a sequence of symbols from 2, and its length 
is denoted by I si. A string is called empty string if its length 
equals 0. Given an index i E {1, 2, . . . , Isl}, then s[i] is the 
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symbol at position i in s. Given two indices i and j, 
whenever j > i, then s[i... j] is the substring of s with length 
j - i + 1 that starts at position i and if j < i, then s[i. . . j] is 
the empty string. 

We will use ring structures instead of linear ones for more 
exact classification. A simple solution for this process is to 
concatenate each string with its fist k bits. Each new linear 
string is a ring representation of its original one. Fig. 3 
shows a ring representation (b) and its original string (a) 
with k = 3. 

0010100111 0010100111Q01 

(a) (b) 

Fig. 3. A ring-based representation (b) of a string (a) 

Given a set of strings S c X e , a set S r c Z c+r_1 includes 
ring representations of all strings in S by concatenate each 
string s E S with its fist r - 1 bits. 

Note that we can easily apply the idea of ring strings for 
other data representations in AIS. One way to do this, for 
instance, is to create ring representations of other structures 
such as trees, automata, etc., from set S r instead of S as usual. 
Our approaches can be implemented on any finite alphabet, 
but strings used in all examples are binary, S = {0, 1}, just for 
easy understanding. 

2.2 R-chunk detectors 

Given a set of strings S c X c , a tuple (d, i) of a string d E 
S r , r < f, and an integer i E {1,..., i } is called an r-chunk 
detector if there exists a s E S r such that d matches s[i,. . ., i 
+ r - 1]. We also use the notations: Si = {(d, i), (d, i) is a 
positive r-chunk detector} is set of all positive r-chunk 
detectors at position i with respect to S r , i = 1,. . ., i. 

Example 1. Eet 1 = 5 matching threshold r = 3. Suppose 
that we have the set of four strings S = {si = 00000; s2 = 
10110; s3 = 10111; s4 = 11111}. S r = {0000000; 1011010; 
1011110; 1111111}. SI = {(000,1); (101,1); (111,1)}, S2 = 
{(000,2); (011,2); (111,2)}, S3 = {(000,3); (110,3); (111,3)}, 
S4= {(000,4); (101,4); (111,4)}, S5 = {(000,5); (010,5); 
(110,5); (111,5)}. 

Given a data set S of string with the same n features, length 
of binary representations of strings is calculated by equation 
f= Yi=i ki where ki is the smallest integer such that ID i l<2 /c f 
and Di is value domain of feature i. 

2.3 Proposed algorithms 

Given i , r, a normal dataset N c S c , an abnormal dataset A 
c £ c . Algorithm 1 creates trees used in the algorithm 2. A 
new data instance s E S c is detected as self or nonself by 
Algorithm 2. 

Algorithm 1: Algorithm to generate trees 

1: procedure TreesGeneration(N, r, T A ) 

Input: A set of self strings N Q £ c , a matching 
threshold r E {1,. . . , i }. 

Output: A set T A of f - r + 1 prefix trees presenting 
trees. 

2: for i = 1,..., i do 
3: Create an empty tree T Ni 
4: for all s E Nr do 

5: for i = 1,..., i do 


6: insert every s[i. . . i + r - 1] into T Ni 

7: for i = 1,..., i do 

8: Create an empty tree T Ai 

9: for all s E A r do 

10: for i = 1,..., i do 

11: insert every s[i. . . i + r — 1] into T Ai 

Algorithm 2: Algorithm PSA to detect if a new data instance 
s E X c is self or nonself. 

1: procedure PSA(N, A, s, r, T N , T A ) 

Input: A set of nonself strings N Q £ c , a set of 
self strings A Q S c , an unlabeled string s E S c , a 
matching threshold rE {1, 2, . . . , i }. 

Output: A set T A of f - r + 1 prefix trees 
presenting nonself trees, a set T N of i - r + 1 
prefix trees presenting self trees, a label of s 
(self or nonself). 

2: TreesGeneration(N, r, T A ) 

3: di = d 2 = d 3 = 0 

4: Create a string s r as ring representation of s 

5: for i = 1,..., i do 
6: s 0 = s r [i. . . i + r - 1] 

7: if s 0 £ T Ni then 

8: d]_ = dj_ + 1 

9: if s 0 £ T Ai then 

10: d 2 = d 2 + 1 

11: if Feaf(s 0 , T Ni ) < Feaf(s 0 , T Ai ).ti then 

12: d 3 = d 3 + 1 

13: if di > t 2 then output s is nonself 

14: else if d 2 > t 3 then output s is self 

15: else if d 3 .t 4 > i then output s is nonself 

16: else output s is self 

IH. Experiments 

3.1 Datasets 

In our experiments, we use two popular flow-based 
datasets: NetFlow [31] and TU [36]. The flow-based 
NetFlow is generated from packet-based DARPA dataset 
[38]. This dataset focuses only on flows to a specific port and 
an IP address which receives the most number of attacks. It 
contains all 129,571 traffics (including attacks) to and from 
victims. Each flow in the datasets has 10 fields: Source IP, 
Destination IP, Source Port, Destination Port, Packets, 
Octets, Start Time, End Time, Flags, and Proto. 

Other labeled flow-based data set was captured by 
monitoring a honey-pot hosted in the University of Twente 
network, so we call it TU dataset. This dataset has three 
categories: malicious, unknown and side-effect. It has 
14170132 flows which are mostly of malicious nature. 

The NetFlow dataset is used for experiment 1 with features 
Packets, Octets, Duration, Scr port, Dst port, Flags and IP 
protocol. The UT dataset is used for experiment 2 with 
features Packets, Octets, Duration and Flags. 

3.2 Performance 

Table 1 illustrates the results of two experiments. PSA 
cannot train with unlabeled data, so we only use 10% labelled 
dataset, 5998 attack flows and 595 normal flows, for training 
phase in experiment 1. Meanwhile two other algorithms use 
100% dataset for training, in which 90% unlabelled dataset 
used for S4VM. Results from the experiment strongly 
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confirm the efficiency of PSA in comparison with methods 
proposed in [35]. 

In experiment 2, we compare PSA performance with some 
other algorithms run on WEKA 3 with their default 
parameters. The detection rate of PSA is highest among the 
other algorithms, while accuracy is remarkable high. Some 
SVMs achieved admirable accuracies with 100%, but their 
FARs are very high with approximately or even 100%. Poor 
FARs of these algorithms means that they cannot verify the 
nature of benign traffic in the dataset, but PSA can. 


Table 1. Comparison between PSA and other algorithms 


Algorithms 

ACC 

DR 

FAR 

Experiment 1 

PSA (10% labelled data) 

0.9781 

0.9626 

0.0157 

S4VM (10% labelled data) [35] 

0.9196 

- 

0.0384 

EBP - based MEP [35] 

0.9655 

- 

0.0315 

Experiment 2 

PSA 

0.9257 

0.9974 

0.1945 

Naive Bayes 

0.6832 

0.9972 

0.8668 

SVM (linear) 

0.7315 

1.0000 

0.7185 

SVM (polynomial) 

0.7143 

0.8788 

0.5613 

SVM (RBF) 

0.6263 

1.0000 

0.9998 

SVM (sigmoid) 

0.6263 

1.0000 

1.0000 

Deep learning 

0.8106 

0.8201 

Nan 


Recent works focus on deep learning, so we produced 
another experiment using deep learning. In the experiment, 
we use 01 hidden layer, the number of unit on each layer is 
100, activation function is Relu, optimization function is 
Adam, learning rate is 0.001, and number of epochs is 200. 
We can see results from the last row of table 1 that all ACC 
and DR is lower than those of our PSA. 

IV. Conclusions 

The major contribution of this study is to propose a ring 
representation instead of linear one for better performance in 
terms of both detection rate and accuracy rate. 

To verify the effectiveness of the proposed approach, two 
different datasets are adopted to validate this approach. The 
results from four experiments indicate that the proposed 
approach can produce competitive and consistent classifying 
performance on real datasets. Moreover, results from 
experiment 2 with only 10% of training dataset confirm that 
PSA can detect anomalies in a small amount of labelled data. 

The algorithm can be applied to the data set that has 
following characters: 1- Having strings with equal number of 
features, and 2 - Value domain of all features is discret. 
However, if the domain is continuous, such as real numbers, 
then a good quantification may be used before applying our 
algorithm. 

In the future, we are planning to combine our algorithms 
with some machine learning methods to have better detection 
performance, reduce training time. Moreover, it would be 
interesting to further develop technique how to choose 
optimal parameters as well as to integrate them in new 
objective functions. We also would like to apply the method 
for other security problems with larger datasets. 

To the best of our knowledge, there has not been any 
published attempt in using ring type of data instead of linear 
one to attain more exact classification. 
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